Bayesian Online Changepoint Detection
Ryan Prescott Adams
Cavendish Laboratory
Cambridge CBS OHE
United Kingdom
David J.C. MacKay
Cavendish Laboratory
Cambridge CBS OHE
United Kingdom
Abstract
Changepoints are abrupt variations in the
generative parameters of a data sequence.
OnUne detection of changepoints is useful in
modehing and prediction of time series in
apphcation areas such as finance, biomet-
rics, and robotics. While frequentist meth-
ods have yielded online filtering and predic-
tion techniques, most Bayesian papers have
focused on the retrospective segmentation
problem. Here we examine the case where
the model parameters before and after the
changepoint are independent and we derive
an online algorithm for exact inference of the
most recent changepoint. We compute the
probability distribution of the length of the
current "run," or time since the last change-
point, using a simple message-passing algo-
rithm. Our implementation is highly modu-
lar so that the algorithm may be applied to
a variety of types of data. We illustrate this
modularity by demonstrating the algorithm
on three different real-world data sets.
1 INTRODUCTION
Changepoint detection is the identification of abrupt
changes in the generative parameters of sequential
data. As an online and offline signal processing tool, it
has proven to be useful in applications such as process
control [1], EEC analysis [5, 2, 17], DNA segmentation
[6], econometrics [7, 18], and disease demographics [9].
Frequentist approaches to changepoint detection, from
the pioneering work of Page [22, 23] and Lorden [19]
to recent work using support vector machines [10], of-
fer online changepoint detectors. Most Bayesian ap-
proaches to changepoint detection, in contrast, have
been offline and retrospective [24, 4, 26, 13, 8]. With a
few exceptions [16, 20] , the Bayesian papers on change-
point detection focus on segmentation and techniques
to generate samples from the posterior distribution
over changepoint locations.
In this paper, we present a Bayesian changepoint de-
tection algorithm for online inference. Rather than
retrospective segmentation, we focus on causal predic-
tive filtering; generating an accurate distribution of the
next unseen datum in the sequence, given only data al-
ready observed. For many applications in machine in-
telligence, this is a natural requirement. Robots must
navigate based on past sensor data from an environ-
ment that may have abruptly changed: a door may
be closed now, for example, or the furniture may have
been moved. In vision systems, the brightness change
when a light switch is flipped or when the sun comes
out.
We assume that a sequence of observations
xi,X2,---,xt may be divided into non-overlapping
product partitions [3]. The delineations between
partitions are called the changepoints. We further
assume that for each partition p, the data within it
are i.i.d. from some probability distribution P{xt \ rjp).
The parameters rjp, p — 1,2,... are taken to be i.i.d.
as well. We denote the contiguous set of observations
between time a and h inclusive as Xaj). The discrete
a priori probability distribution over the interval
between changepoints is denoted as Pgapig)-
We are concerned with estimating the posterior dis-
tribution over the current "run length," or time since
the last changepoint, given the data so far observed.
We denote the length of the current run at time t by
rt- We also use the notation xl ' to indicate the set
of observations associated with the run rt. As r may
be zero, the set x'^'^^ may be empty. We illustrate the
relationship between the run length r and some hypo-
thetical univariate data in Figures 1(a) and 1(b).
t.
Qz
93
• •
•
•
• •
H — \ — \ — h
H — \ — \ — \ — \ — h
H — ^ — ^ — \ — ►
(a)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 t
(b)
Figure 1: This figure illustrates how we describe a
changepoint model expressed in terms of run lengths.
Figure 1 (a) shows hypothetical univariate data divided
by changepoints on the mean into three segments of
lengths gi = 4, 52 = 6, and an undetermined length
513. Figure 1(b) shows the nm length rj as a function
of time, ft drops to zero when a changepoint occurs.
Figure 1(c) shows the trellis on which the message-
passing algorithm lives. Solid lines indicate that prob-
ability mass is being passed "upwards," causing the
run length to grow at the next time step. Dotted lines
indicate the possibility that the current run is trun-
cated and the run length drops to zero.
2 RECURSIVE RUN LENGTH
ESTIMATION
We assume that we can compute the predictive distri-
bution conditional on a given run length rt- We then
integrate over the posterior distribution on the current
run length to find the marginal predictive distribution:
P{Xt+l I Xi-t)
Y,P{xt+i\rt,x['-^)P{n\x,..t) (1)
To find the posterior distribution
P{rt,xi:t)
Pin I Xl:t) =
P{xv.t)
(2)
we write the joint distribution over run length and
observed data recursively.
P{rt,Xi:L} = ^ P{rt,rt-i,xi:t)
rt-i
= P{rt,xt\rt-i,xi:t-i)P{rt-i,xi:t-i) (3)
rt-l
= ^ P{rt I rt-i)Pixt I rt-i,xi''>)P{rt-i,Xi..t-i)
n-i
Note that the predictive distribution P{xt \rt-i,Xi;t)
(r)
depends only on the recent data x). . We can thus
generate a recursive message-passing algorithm for the
joint distribution over the current run length and the
data, based on two calculations: 1) the prior over
given rt-i, and 2) the predictive distribution over the
newly-observed datum, given the data since the last
changepoint.
2.1 THE CHANGEPOINT PRIOR
The conditional prior on the changepoint P{rt \rt-i)
gives this algorithm its computational efficiency, as it
has nonzero mass at only two outcomes: the run length
either continues to grow and rt = rt-i -|- 1 or a change-
point occurs and rt = 0.
P{rt\rt-i)
if rt =
if rt = rt-i + 1
otherwise
The function -ff(r) is the hazard function. [11].
Hir) =
■Pgap(ff=T)
(4)
(5)
In the special case is where Pgap(.g) is a discrete expo-
nential (geometric) distribution with timescale A, the
process is memoryless and the hazard function is con-
stant at H{t) = 1/A.
Figure 1(c) illustrates the resulting message- passing
algorithm. In this diagram, the circles represent run-
length hypotheses. The lines between the circles show
recursive transfer of mass between time steps. Solid
lines indicate that probability mass is being passed
"upwards," causing the run length to grow at the next
time step. Dotted lines indicate that the current run
is truncated and the run length drops to zero.
1. Initialize
P(ro) = S{r) or P(ro=0) = 1
(0) _
(0)
1 A. prior
2. Observe New Datum xt
3. Evaluate Predictive Probability
4. Calculate Growth Probabilities
P(rt=rt_i+l,a;i:t) = P{rt-i,xv.t-i)4"\^-H{rt-i))
5. Calculate Changepoint Probabilities
P{rt=0,xv.t) = J2P{rt-i,xv.t-i)ni^^H{rt-i)
rt-l
6. Calculate Evidence
P(a;i:t) =^P(rt,a^i:t)
rt
7. Determine Run Length Distribution
P{rt\xi..t) = P{rt,xi..t)/P{xi:t)
8. Update Sufficient Statistics
(0)
V - . Xprior
9. Perform Prediction
10. Return to Step 2
2.2 BOUNDARY CONDITIONS
A recursive algorithm must not only define the recur-
rence relation, but also the initialization conditions.
We consider two cases: 1) a changepoint occurred a
priori before the first datum, such as when observing
a game. In such cases we place all of the probability
mass for the initial run length at zero, i.e. P(ro=0) = 1.
2) We observe some recent subset of the data, such as
when modelling climate change. In this case the prior
over the initial run length is the normalized survival
function [11]
P(ro=T) = ^S{t), (6)
where Z is an appropriate normalizing constant, and
oo
S{r) = E ^gap(5=t). (7)
t=T+l
2.3 CONJUGATE-EXPONENTIAL
MODELS
Conjugate-exponential models arc particularly conve-
nient for integrating with the changepoint detection
scheme described here. Exponential family likelihoods
allow inference with a finite number of sufficient statis-
tics which can be calculated incrementally as data ar-
rives. Exponential family likelihoods have the form
P{x\r]) = h{x)exp{r]''U{x) - A{r])) (8)
where
A{r]) = log j dr) h{x) exp (r/T[/(a;)) . (9)
The strength of the conjugate-exponential representa-
tion is that both the prior and posterior take the form
of an exponential-family distribution over rj that can
be summarized by succinct hyperparameters v and %.
P{n\x, i") = Hv) exp (ry^x - i'A(ry) - A{x,
(10)
We wish to infer the parameter vector rj associated
with the data from a current run length rt ■ We denote
(r)
this run-specific model parameter as rjl '. After find-
(r) {r)\
ing the posterior distribution P{ril \rt,x^ ), we can
marginalize out the parameters to find the predictive
distribution, conditional on the length of the current
run.
P{xt+i \rt) = J dr, P{xt+i I r,)Piv['^^ \ r*, a;*''))
(11)
Algorithm 1: The online changepoint algorithm with
prediction. An additional optimization not shown is
to truncate the per-timestep vectors when the tail of
Pirtlxiit) has mass beneath a threshold.
This marginal predictive distribution, while gener-
ally not itself an exponential-family distribution, is
usually a simple function of the sufficient statistics.
When exact distributions are not available, compact
approximations such as that described by Snelson and
Ghahramani [25] may be useful. We will only ad-
dress the exact case in this paper, where the predictive
distribution associated with a particular current run
(r) (r)
length is parameterized by and Xt ■
v't^ = Vpnox -^rt (12)
xle5
1
10-^
10-^ 2
10-*
10^
1700
1900
2300
2500
2700
2100
Time
Figure 2: The top plot is a 1100-datum subset of nuclear magnetic response during the drilling of a well. The
data are plotted in light gray, with the predictive mean (solid dark line) and predictive 1-a error bars (dotted
lines) overlaid. The bottom plot shows the posterior probability of the current run P{rt \ xi;t) at each time step,
using a logarithmic color scale. Darker pixels indicate higher probability.
2.4 COMPUTATIONAL COST
The complete algorithm, assuming exponential-family
likelihoods, is shown in Algorithm 1. The space- and
time-complexity per time-step are linear in the number
of data points so far observed. A trivial modification of
the algorithm is to discard the run length probability
estimates in the tail of the distribution which have a
total mass less than some threshold, say 10"''. This
yields a constant average complexity per iteration on
the order of the expected run length E[r], although
the worst-case complexity is still linear in the data.
3 EXPERIMENTAL RESULTS
In this section we demonstrate several implementa-
tions of the changepoint algorithm developed in this
paper. We examine three real-world example datasets.
The first case is a varying Gaussian mean from well-
log data. In the second example we consider abrupt
changes of variance in daily returns of the Dow Jones
Industrial Average. The final data are the intervals
between coal mining disasters, which we model as a
Poisson process. In each of the three examples, we use
a discrete exponential prior over the interval between
changepoints.
3.1 WELL-LOG DATA
rock surrounding the well. The variations in mean
reflect the stratification of the earth's crust. These
data have been studied in the context of changepoint
detection by O Ruanaidh and Fitzgerald [21], and by
Fearnhead and Chfford [12].
The changepoint detection algorithm was run on these
data using a univariate Gaussian model with prior pa-
rameters n = 1.15 X 10^ and a = Ix 10"'. The rate of
the discrete exponential prior, Agap, was 250. A sub-
set of the data is shown in Figure 2, with the predic-
tive mean and standard deviation overlaid on the top
plot. The bottom plot shows the log probability over
the current run length at each time step. Notice that
the drops to zero run-length correspond well with the
abrupt changes in the mean of the data. Immediately
after a changepoint, the predictive variance increases,
as would be expected for a sudden reduction in data.
3.2 1972-75 DOW JONES RETURNS
During the three year period from the middle of 1972
to the middle of 1975, several major events occurred
that had potential macroeconomic effects. Significant
among these are the Watergate affair and the OPEC
oil embargo. We applied the changepoint detection
algorithm described here to daily returns of the Dow
Jones Industrial Average from July 3, 1972 to June 30,
1975. We modelled the returns
These data are 4050 measurements of nuclear magnetic
response taken during the drilling of a well. The data
are used to interpret the geophysical structure of the
(14)
500
400
□1
300
3
200
i
IQO
1
10-^
10'* .w;
10^1
10 ■
10"^
30 January 1973:
Cbnvitt^d
19 October 1973:
apEli. Erntujrgs Begins
9 August 1974:
Figure 3: The top plot shows daily returns on the Dow Jones Industrial Average, with an overlaid plot of the
predictive volatility. The bottom plot shows the posterior probability of the current run length P{rt \ Xi:t) at
each time step, using a logarithmic color scale. Darker pixels indicate higher probability. The time axis is in
business days, as this is market data. Three events are marked: the conviction of G. Gordon Liddy and James
W. McCord, Jr. on January 30, 1973; the beginning of the OPEC embargo against the United States on October
19, 1973; and the resignation of President Nixon on August 9, 1974.
(where is the daily closing price) with a zero-
mean Gaussian distribution and piecewise-constant
variance. Hsu [14] performed a similar analysis on a
subset of these data, using frequentist techniques and
weekly returns.
We used a gamma prior on the inverse variance, with
a = 1 and b = 10~^. The exponential prior on change-
point interval had rate Agap = 2-50. In Figure 3, the 4 DISCUSSION
top plot shows the daily returns with the predictive
standard deviation overlaid. The bottom plot shows
the posterior probability of the current run length,
P{rt\xi:t). Three events are marked on the plot:
the conviction of Nixon re-election officials G. Gordon
Liddy and James W. McCord, Jr., the beginning of
the oil embargo against the United States by the Or-
ganization of Petroleum Exporting Countries (OPEC),
and the resignation of President Nixon.
3.3 COAL MINE DISASTER DATA
These data from Jarrett [15] are dates of coal mining
explosions that killed ten or more men between March
15, 1851 and March 22, 1962. We modelled the data
as an Poisson process by weeks, with a gamma prior
on the rate with a = b = 1. The rate of the exponen-
Possion process determines the local average of the
slope. The posterior probability of the current run
length is shown in the bottom plot. The introduc-
tion of the Coal Mines Regulations Act in 1887 (cor-
responding to weeks 1868 to 1920) is also marked.
This paper contributes a predictive, online interpeta-
tion of Bayesian changepoint detection and provides
a simple and exact method for calculating the poste-
rior probability of the current run length. We have
demonstrated this algorithm on three real-world data
sets with different modelling requirements.
Additionally, this framework provides convenient de-
lineation between the implementation of the change-
point algorithm and the implementation of the model.
This modularity allows changepoint-detection code to
use an object-oriented, "pluggable" type architecture.
tial prior on the changepoint inverval was Agap = 1000.
The data are shown in Figure 4. The top plot shows
the cumulative number of accidents. The rate of the
Acknowledgements
The authors would like to thank Phil Cowans and Mar-
ian Frazier for valuable discussions. This work was
funded by the Gates Cambridge Trust.
B 200
3000
Weeks
5000
Coal Mines
Regulations Act
Figure 4: These data are the weekly occurrence of coal mine disasters that killed ten or more people between
1851 and 1962. The top plot is the cumulative number of accidents. The accident rate determines the local
average slope of the plot. The introduction of the Coal Mines Regulations Act in 1887 is marked. The year 1887
corresponds to weeks 1868 to 1920 on this plot. The bottom plot shows the posterior probability of the current
run length at each time step, P{rt \ Xi-t).
References
[1] Leo A. Aroian and Howard Levene. The effec-
tiveness of quality control charts. Journal of the
American Statistical Association, 45(252):520-
529, 1950.
[2] J. S. Barlow, O. D. Creutzfeldt, D. Michael,
J. Houchin, and H. Epelbaum. Automatic
adaptive segmentation of clinical EEGs,. Elec-
troencephalography and Clinical Neurophysiology,
51(5):512-525, May 1981.
[3] D. Barry and J. A. Hartigan. Product partition
models for change point problems. The Annals of
Statistics, 20:260-279, 1992.
[4] D. Barry and J. A. Hartigan. A Bayesian analysis
of change point problems. Journal of the Ameri-
can Statistical Association, 88:309-319, 1993.
[5] G. Bodenstein and H. M. Practorius. Feature ex-
traction from the electroencephalogram by adap-
tive segmentation. Proceedings of the IEEE,
65(5):642-652, 1977.
[6] J. V. Braun, R. K. Braun, and H. G. Miiller.
Multiple changepoint fitting via quasilikelihood,
with application to DNA sequence segmentation.
Biometrika, 87(2):301-314, June 2000.
[7] Jie Chen and A. K. Gupta. Testing and locating
variance changepoints with application to stock
prices. Journal of the American Statistical Asso-
ciation, 92(438) :739-747, June 1997.
[8] Siddhartha Chib. Estimation and comparison of
multiple change-point models. Journal of Econo-
metrics, 86(2):221-241, October 1998.
[9] D. Denison and C. Holmes. Bayesian partitioning
for estimating disease risk, 1999.
[10] F. Desobry, M. Davy, and C. Doncarli. An on-
line kernel change detection algorithm. IEEE
Transactions on Signal Processing, 53(8):2961-
2974, August 2005.
[11] Merran Evans, Nicholas Hastings, and Brian
Peacock. Statistical Distributions. Wiley-
Interscience, June 2000.
[12] Paul Fearnhead and Peter Clifford. On-line in-
ference for hidden Markov models via particle fil-
ters. Journal of the Royal Statistical Society B,
65(4):887-899, 2003.
[13] P. Green. Reversible jump Markov chain Monte
Carlo computation and Bayesian model determi-
nation, 1995.
[14] D. A. Hsu. Tests for variance shift at an un-
known time point. Applied Statistics, 26(3) :279-
284, 1977.
[15] R. G. Jarrctt. A note on the intervals between
coal-mining disasters. Biometrika, 66(1):191-193,
1979.
[16] Timothy T. Jervis and Stuart I. Jardine. Alarm
system for wellbore site. United States Patent
5,952,569, October 1997.
[17] A. Y. Kaplan and S. L. Shishkin. Application of
the change-point analysis to the investigation of
the brain's electrical activity. In B. E. Brodsky
and B. S. Darkhovsky, editors, Non-Parametric
Statistical Diagnosis : Problems and Methods,
pages 333-388. Springer, 2000.
[18] Gary M. Koop and Simon M. Potter. Forecast-
ing and estimating multiple change-point models
with an unknown number of change points. Tech-
nical report. Federal Reserve Bank of New York,
December 2004.
[19] G. Lorden. Procedures for reacting to a change in
distribution. The Annals of Mathematical Statis-
tics, 42(6):1897-1908, December 1971.
[20] J. J. 6 Ruanaidh, W. J. Fitzgerald, and K. J.
Pope. Recursive Bayesian location of a discon-
tinuity in time series. In Acoustics, Speech, and,
Signal Processing, 1991 ICASSP-94., 1994 IEEE
International Conference on, volume iv, pages
IV/513-IV/516 vol.4, 1994.
[21] Joseph .1. K. O Ruanaidh and William J. Fitzger-
ald. Numerical Bayesian Methods Applied to
Signal Processing (Statistics and Computing).
Springer, February 1996.
[22] E. S. Page. Continuous inspection schemes.
Biometrika, 41(1/2):100-115, June 1954.
[23] E. S. Page. A test for a change in a parame-
ter occurring at an unknown point. Biometrika,
42(3/4):523-527, 1955.
[24] A. F. M. Smith. A Bayesian approach to infer-
ence about a change-point in a sequence of ran-
dom variables. Biom.etrika, 62(2):407 416, 1975.
[25] Edward Snelson and Zoubin Ghahramani. Com-
pact approximations to Bayesian predictive dis-
tributions. In ICML '05: Proceedings of the
22nd international conference on Machine learn-
ing, pages 840-847, New York, NY, USA, 2005.
ACM Press.
[26] D. A. Stephens. Bayesian retrospective multiple-
changepoint identification. Applied Statistics,
43:159-178, 1994.